Search CORE

6 research outputs found

A free/open-source hybrid morphological disambiguation tool for Kazakh

Author: Abduali Balzhan
Amirova Dina
Assylbekov Zhenisbek
Karibayeva Aidana
Nurkas Assulan
Sundetova Aida
Tyers Francis
Washington Jonathan
Publication venue: DOI: 10.13140/RG.2.2.12467.43045
Publication date: 01/04/2016
Field of study

This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text

Nazarbayev University Repository

A free/open-source hybrid morphological disambiguation tool for Kazakh

Author: Abduali Balzhan
Amirova Dina
Assylbekov Zhenisbek
Karibayeva Aidana
Nurkas Assulan
Sundetova Aida
Tyers Francis
Washington Jonathan
Publication venue: DOI: 10.13140/RG.2.2.12467.43045
Publication date: 01/04/2016
Field of study

Nazarbayev University Repository

Технології вирівнювання та розширення паралельних корпусів казахської мови

Author: Karibayeva Aidana
Rakhimova Diana
Publication venue: 'Private Company Technology Center'
Publication date: 31/08/2022
Field of study

The paper presents the two-stage alignment and extending methods of parallel corpora for the Kazakh language. The Kazakh language is agglutinative with rich morphology and related to the Turkic language group. So, the traditional alignment methods for similar languages do not work for the Kazakh language. The alignment is used primarily to ensure that the fragment corresponding to the original is found in the translation. After that, identical fragments of parallel texts are compared with each other. At the initial stage, the question is what needs to be leveled. It is possible to align word by word, but this often becomes almost impossible for several reasons: sets of lexemes and expressions do not match in different languages. Considering the linguistic peculiarities of languages, the developed technologies and ways of universal alignment of parallel text may not work in languages with agglutination. It means that the form of the word is formed by additional affixes and auxiliary words that carry semantic and morphological information. The approach presented in this paper is to use a two-stage alignment, which uses a bilingual dictionary of synonyms. The evaluation with the use of the English-Kazakh corpus verifies that our method shows an average of 89 % correct alignment. The second method is designed to expand the parallel corpus due to the lack of natural parallel corpora of the Kazakh-English language pair with good quality. The developed method uses a combinatorial method taking into account the semantic and grammatical features of the Kazakh language. Different tenses of the Kazakh language are used for sentence generation, and different endings for parts of speech are also considered.У роботі представлені методи двоетапного вирівнювання та розширення паралельних корпусів казахської мови. Казахська мова є аглютинативною, має багату морфологію та відноситься до тюркської мовної групи. Тому традиційні методи вирівнювання з подібними мовами не підходять для казахської мови. Вирівнювання використовується в першу чергу для знаходження у перекладі фрагмента, що відповідає оригіналу. Після цього ідентичні фрагменти паралельних текстів порівнюють один з одним. На початковому етапі питання полягає у тому, що підлягає вирівнюванню. Можна виконати послівне вирівнювання, але часто це стає практично неможливим з кількох причин: набори лексем та виразів у різних мовах не співпадають. Враховуючи лінгвістичні особливості мов, розроблені технології та способи універсального вирівнювання паралельного тексту можуть не підійти для мов з аглютинацією. Це означає, що форма слова утворюється додатковими афіксами та допоміжними словами, що несуть семантичну і морфологічну інформацію. Підхід, представлений в даній роботі, полягає у застосуванні двоетапного вирівнювання з використанням двомовного словника синонімів. Оцінка з використанням англо-казахського корпусу підтверджує правильність вирівнювання нашим методом в середньому на 89 %. Другий метод призначений для розширення паралельного корпусу у зв'язку із відсутністю хорошої якості природних паралельних корпусів казахсько-англійської мовної пари. У розробленому методі використовується комбінаторна техніка з урахуванням семантичних та граматичних особливостей казахської мови. Для побудови речень використовують різні часи казахської мови, а також враховуються різні закінчення частин мови

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Eastern-European Journal of Enterprise Technologies

Neural machine translation system for the Kazakh language based on synthetic corpora

Author: Abduali Balzhan
Karibayeva Aidana
Tukeyev Ualsher
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs

EDP Sciences OAI-PMH repository (1.2.0)

Directory of Open Access Journals

Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language

Author: Aidana Karibayeva
Aliya Turganbayeva
Asem Turarbek
Diana Rakhimova
Vladislav Karyukin
Publication venue: MDPI AG
Publication date: 01/08/2022
Field of study

The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages still suffer from the translation of complex sentences, and translation services leave many structural errors. This study focused on designing the rules for the grammatical structures of complex sentences in the Kazakh language, which has a difficult grammar with many rules. First, the types of complex sentences in the Kazakh language were thoroughly observed with the use of templates from the FuzzyWuzzy library. Then, the correction of complex sentences was completed with parallel corpora. The sentences were translated into English and Russian by existing machine translation systems. Therefore, the grammar of both Kazakh–English and Kazakh–Russian language pairs was considered. They both used the rules specifically designed for the post-editing steps. Finally, the performance of the developed algorithm was evaluated for an accuracy score for each pair of languages. This approach was then proposed for use in other corpora generation, post-editing, and analysis systems in future works

Directory of Open Access Journals

Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language

Author: Aidana Karibayeva
Aliya Turganbayeva
Asem Turarbek
Diana Rakhimova
Vladislav Karyukin
Publication venue: 'MDPI AG'
Publication date: 30/08/2022
Field of study

Multidisciplinary Digital Publishing Institute